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Abstract 

It is well known [Knu97l pages 399-400] that in a binary tree the external path length 
minus the internal path length is exactly 2n — 2, where n is the number of external nodes. 
We show that a generalization of the formula holds for compacted tries, replacing the role 
of paths with the notion of extent, and the value 2n — 2 with the trie measure, an estimation 
of the number of bits that are necessary to describe the trie. 

1 Introduction 

The well-known formula |Knu97[ pages 399-400] 

E = I + 2n - 2, 

where n is the number of external nodes, relates the external path length E oi a binary tree (the 
sum of the lengths of the paths leading to external nodes) with the internal path length I (the 
sum of the lengths of the paths leading to internal nodes) 1^ 

A compacted (binary) trie is a binary tree where each node (both internal and external) is 
endowed with a (binary) string (possibly empty) called compacted path. For a compacted trie, 
if we extend in the natural way the values of E and / the formula is no longer valid. In this 
note we provide a suitable generalization of the formula, using the definition of extent of a node 
(which collapses to the definition of path when all compacted paths are empty). We show that 
E = I + T, where E is the sum of the lengths of external extents, I is the sum of the lengths 
of internal extents, and T is the trie measure, which approximates the number of bits that are 
necessary to describe the trie. If all compacted paths are empty the trie measure is 2n — 2, so 
our equation is a generalization of the classical result. We also provide a generalization to the 
case of non-binary tries. 

2 Definitions 

We work out our definitions from scratch closely following Knuth's, as the notation that can be 
found in the literature is not always consistent. 



^The formula actually reported by Knuth is slightly different {E = I + 2n) because in his notation n is the 
number of internal nodes, which is equal to the number of external nodes minus one. As we will see, for compacted 
tries the number of external nodes is equal to the size of the set of strings represented by the trie, and so it is a 
more natural candidate for the letter n. 
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Binary trees. A binary tree is eittier tlie empty binary tree or a pair of binary trees (called 
the left subtree and the right subtree) [Knu971 page 312lo 

A binary tree can be represented as a rooted treq^ in which nodes are either internal or 
external. The empty binary tree is represented by a single external root node. Otherwise, a 
binary tree is represented by an internal root node connected to the representations of the left 
and right subtree by two edges labelled and 1. Note that external nodes have no children, 
whereas internal nodes have always exactly two children^ 

Compacted binary tries. A compacted binary trie is either a binary string, called a compacted 
path, or a binary string endowed with a pair of binary tries (called the left subtrie and the right 
subtrie). Equivalently, a compacted binary trie can be seen as a labelling of the nodes of a binary 
tree with compacted paths. 

Given a nonempty prefix- free set of strings S* C 2*, the associated compacted binary trie is: 

• the only string in the set, if |5| = 1; 

• otherwise, let p be the longest common prefix of the strings in S; then, the trie associated to 
S is given by the string p and by the pair of tries associated with the sets { a; £ 2* | pbx G S" }, 
for 6 = 0, 1. 

A compacted binary trie can be represented as a rooted tree in which, as in the case of binary 
trees, nodes are either internal or external. A single string is represented by a single external 
root node labelled by the string. Otherwise, a string and a pair of subtries are represented by an 
internal root node labelled by the string, connected to the representations of the first and second 
subtrie by two edges labelled and 1 (see Figure [5]). From this representation, the set S can be 
recovered by looking at the labelled paths going from the root to external nodes. 

Given a node a of the trie (see again Figure [5]) : 

• the extent of a is the longest common prefix of the strings represented by the external 
nodes that are descendants of a; 

• the compacted path of a, denoted by Cq, is the string labelling a; 

• the name of a is the extent of a deprived of its suffix Cq. 

We will use the name internal extent {external extent, resp.) for the extent of an internal 
(external, resp.) node. 

A data-aware measure. Consider the compacted trie associated with a nonempty set 5 C 2*. 
We define the trie measure of S [GHSVOT] as 

T(5)=^(|c„| + l)-l = 0(n^) 

a 

^We remark that the definition we use (a slight abstraction on Knuth's) is the simplest and most correct 
from a combinatorial viewpoint, but might sound unfamihar. An alternative commonly found description says 
that a binary tree is given by a node with a left and a right subtree, either of which might be empty; the latter 
definition, however, does not account for the empty binary tree, which is essential in making the left-child-right- 
sibling isomorphism with ordered forests work (see again IKnu97l pages 334-335]). 

•^An acyclic connected graph with a chosen node (the root). As observed by Knuth |Knu97l page 312], a tree 
(in the graph-theoretical sense) and a binary tree are two completely different combinatorial objects. 

*We remark that it is common to forget about external nodes altogether and consider only internal nodes as 
"true" nodes of the binary tree. In this setting, there are nodes with no children, nodes with a single child (left or 
right), and nodes with two children. As noted by Knuth, handling external nodes explicitly makes the structure 
"more convenient to deal with" . In our case, external nodes are essential in the very definition of E. 
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So 001001010 

si 00100110100100010 

S2 001001101001001 

Figure 1: A toy example set S. 




Figure 2: The rooted-tree representation of the compacted trie associated with the set S of 
Figure [U and the related names. Arrows display the direction from the root to the external 
nodes. 
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where the summation ranges over aU nodes of the trie, n = \S\ and £ is the average length of 
the elements of S. Actually, T{S) is the number of edges of the standard (non-compacted) trie 
associated with S. 

This measure is directly related to the number of bits required to encode the compacted trie 
associated with S explicitly: indeed, to do this we just need to encode the trie structure (as a 
binary tree) and to write down in preorder all the Cq's. Since there are n external nodes (hence 
n — 1 internal nodes), writing a concatenation of the Cq's requires T(S) — 2n + 2 bits; then we 
need log {2n%) additional bits to store the starting point of each Ca , whereas the trie structure 
needs just 2n — 2 bits (e.g., using Jacobson's representation for binary trees |Jac89j '). All in all, 
the space required to store the trie is 



T(5) + log 



/ T{S) \ 
\2n-2)' 



More precisely, the above number of bits is sufficient to write every trie with n external nodes 
and measure T{ S), and it is necessary for at least one such trie (whichever representation is 
used) iFGCx+OS] . 
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We start by generalizing the internal path formula for binary trees to an internal extent formula 
for compacted binary tries: 

Theorem 1 Let S he a nonempty prefix-free set of n binary strings with average length I, and 
consider the compacted binary trie associated with S . Let E be the sum of the lengths of the 
external extents ( equivalently: E — n£, the sum of the lengths of the strings in S), I the sum of 
the lengths of the internal extents, and T the trie measure of S . Then, 

E = I + T. 

Proof. We prove the theorem by induction on n. The theorem is obviously true for n = 1, as in 
this case E = \ca\, 1 = and r=|cQ,| + l — 1 = |cq|. Consider now the case of a trie with root 
a and subtries with their values no, ni, Eq, Ei, Iq, Ii, Tq, and Ti. Then, using the definitions, 
we have 

E = {Eo + El) + (|c„| + l)(no + ni) 

I = {Iq+ h) + (|ca| + l)(no - 1 + ni - 1) + |ca| 

= (/o + h) + (|ca| + l)(no + ni - 1) - 1 
T - (To + 1) + (Ti + 1) + (|c„| + 1) - 1 = To +Ti + |c„| + 2 

n = no + ni. 

Adding the equations Ej = Ij + Tj for j = 0, 1 (which hold by inductive hypothesis) we have 

Eo+Ei=Io + Ii+To+Ti. 

We add (|cq| + l)(rio + ni) to both sides, getting 

Eo + El + {\co,\ + l){no + ni) = /„ + /i + (|c„ | + l)(no + ni - 1) + To + Ti + |c„ | + 1 
Eo + El + {\ca,\ + l){no + ni) = /+1 + T-1, 



4 



which entails the thesis. I 

As noted in the introduction, when aU compacted paths are empty E is equal to the external path 
length, / is equal to the internal path length, and the trie measure is exactly (X]q l) — 1 = 2n — 2. 
Thus, the internal extent formula is truly a generalization of the internal path formula. 

4 A simple application 

We were lead to the equation E = I + T hy the problem of bounding the average length of an 
internal extent in terms of the average length of an external extent, that is, in terms of £, the 
average length of the strings in S. This bound can now be easily obtained: 

Corollary 1 Let \S\ > 2 be a set of binary strings. With the notation of Theorem]^ 

I/{n-l) <£- 3/2+1/71. 

Proof. We just divide both members of the internal extent equation by n: 

I ^ 2n-2 + EJca| 
n — 1 n 
1 /-(n-l)(n/2-l + Ec.|ca|) 
n n{n — 1) 

1 
n 

To see why the last bound is true, note that in a trie with n — 1 internal nodes the contribution 
to / of the edges (i.e., excluding the compacted paths) is at most (n — l)(n — 2)/2 (the worst case 
is a linear trie). On the other hand, the contribution of compacted paths to each internal path 
cannot be more than \ca\, so the overall contribution cannot be more than (n — 1) \ca\. 
We conclude that 

/< (n-l)((n-2)/2 + ^|ca|).| 

a 

Note that the bound is essentially tight, as in a linear trie with empty compacted paths 
E = n{n + l)/2 - 1 and / (n - 2)(n - l)/2, so E/n - I/{n - 1) = 3/2 - 1/n. 

5 A generalization to non-binary tries 

Given an alphabet S, a compacted trie over S is defined as follows: it is either a single string 
X € E*, or a string a; G E* together with a subset X CT, with \X\ > 1 endowed with a function 
C that assigns a compacted trie over S to each element of X. 

Given a nonempty prefix- free set of strings S* C E*, the associated compacted trie over E is: 

• the only string in the set, if |5| = 1; 

• otherwise, let p be the longest common prefix of the strings in S; then, the trie associated 
with S is given by p, the set X C E of all a 6 E such that pa is the prefix of some 
string in S, and by the function C mapping a to the compacted trie associated with the set 
{ X e E* I pax G 5 }. 



> 



/ 


+ 


T 




n 




n 






I 




/ 


n 




1^ 


n 




I 




3 


n 




1^ 


2 




I 




3 


n 




1^ 


2 



5 



Similarly to what happens for compacted binary tries, a compacted trie over E can be rep- 
resented as a rooted tree where each node is labelled by a (possibly empty) string over E and 
internal nodes have at most |S| (but not less than two) children, each associated with a distinct 
symbol of E. The notation of Figure [2] carries on easily, and the definition of trie measure is 
extended in the natural way. 

We now want to generalize the internal extent formula (Theorem [1} to nonbinary tries. 

Theorem 2 Let S he a nonempty prefix-free set of n strings over an alphabet with a symbols, 
and consider the compacted trie associated with S. For each d = 0, . . . ,a, let Y{d) be the sum of 
the lengths of the extents of nodes with d children, n{d) be the number of such nodes, and T be 
the trie measure. Then, 

E = Y.{d-l)Y{d)+T. 

d=2 

Proof. By induction on the number of nodes. This is true for a one-node trie; for the induction 
step, suppose that the root of a trie has a compacted path of length c, and h subtries (2 < /i < a); 
for the i-ih subtrie, by induction hypothesis, since E = y(0) we have 

a 

Y,{Q) = Y,{d-l)Y,{d)+T,. (1) 

d=2 

Observe that, for every (i = 0, 2, 3, . . . , ct, 

h f ^ \ 

Y{d) = ^«(^) + i.c+^)\\d^h]+Y. ^^■'('^) ]~[d=h] 

4=1 V i=l / 

where we used Iverson's notation^ Moreover 

h 

n{d) = [d = /i] + ^ ni{d) 

i=l 

so 

h 

Y{d) = ^»(^) + + l)'^(^) -{d=h]. 

i=l 

Further 

h 

T = J2Tt + h + c. 

i=l 

Summing ([T]) memberwise, we obtain 

h a / h \ h 

^K,(0) = ^(d-l)K]KKd) -K^T, 

1=1 d=2 \i=l / i=l 

that is equivalent to 

r(0) - (c + l)n(O) + [Q = h]= ^(d - 1) {Y{d) - (c + l)n{d) + [d = h]) + T - h - c, 

d=2 

^For a given Boolean predicate we let [0] be if is false, 1 if is true |Knu92| . 
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hence 

F(0) = Y,{d - l)Y{d) - (c + 1) ( - iMd) - n(0) J +h-l + T -h-c. 

d=2 \d=2 / 

Since X^d=2(^ ~ l)n(d) = n(0) — 1, we have 

y(o) = ^(d-i)y(d) + r.i 

d=2 
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